Web Indexing on a Diet: Template Removal with the Sandwich Algorithm

نویسندگان

  • Tom Rowlands
  • Paul Thomas
  • Stephen Wan
چکیده

Web pages contain both unique text, which we should include in indexes, and template text such as navigation strips and copyright notices which we may want to discard. While algorithms exist for removing template text, most rely on first completing a crawl and then parsing each page. We present a cheap and efficient algorithm which does not parse HTML and which requires only a single pass of the document. We have used two web corpora to investigate the performance of a retrieval system using our algorithm and have found similar eectiveness with an index 9-54% smaller. Further experiments using a marked-up corpus have shown 97% of desired lines are returned.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Electrochemical Sulfate Removal from Aqueous Solution Using Sandwich Panel Carbon Cloth Electrode by Steel Mesh: A laboratory study

Background and Objectives: Sulfate is one of the chemical pollutants in water that can cause adverse health effects such as digestive and blood problems at high concentrations in humans. Conversion of sulfate to substances such as hydrogen sulfide can corrode metal pipes. Therefore, the aim of this study was to investigate the effect of electrochemical sulfate removal using a sandwich panel car...

متن کامل

A New RSTB Invariant Image Template Matching Based on Log-Spectrum and Modified ICA

Template matching is a widely used technique in many of image processing and machine vision applications. In this paper we propose a new as well as a fast and reliable template matching algorithm which is invariant to Rotation, Scale, Translation and Brightness (RSTB) changes. For this purpose, we adopt the idea of ring projection transform (RPT) of image. In the proposed algorithm, two novel s...

متن کامل

Automatic Extraction of Complex Web Data

A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the...

متن کامل

Effects of the asymmetric behavior of the shape memory alloy on nonlinear dynamic responses of thick sandwich plates with embedded SMA wires

In the present article, the dynamic behavior of sandwich plates with embedded shape memory alloy (SMA) wires is evaluated for two cases wherein (i) the stress-strain curve of the superelastic behavior of the SMA wires is symmetric and (ii) the mentioned curve is non-symmetric. A modified version of Brinson’s constitutive model is proposed and used. The high non-linearity in the behavior stems f...

متن کامل

A Soft and Efficient Approach for Removal of Template from Mesoporous Silica using Benzene Sulfonamide

In this contribution, an effective and soft method for removal of template from nanochannels of mesoporous silica (MCM-41) is proposed. This method is based on chemically-modified solvent extraction which enhanced by means of an auxiliary organic compound, i.e. benzene sulfonamide. Template removal was performed in soft condition, i.e. in the presence of diluted sulfuric acid and at ambient tem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009